Dynamic Warp Resizing in High-Performance SIMT

نویسندگان

Ahmad Lashgar

Amirali Baniasadi

Ahmad Khonsari

چکیده

—Modern GPUs synchronize threads grouped in a warp at every instruction. These results in improving SIMD efficiency and makes sharing fetch and decode resources possible. The number of threads included in each warp (or warp size) affects divergence, synchronization overhead and the efficiency of memory access coalescing. Small warps reduce the performance penalty associated with branch and memory divergence at the expense of a reduction in memory coalescing. Large warps enhance memory coalescing significantly but also increase branch and memory divergence. Dynamic workload behavior, including branch/memory divergence and coalescing, is an important factor in determining the warp size returning best performance. Optimal warp size can vary from one workload to another or from one program phase to the next. Based on this observation, we propose Dynamic Warp Resizing (DWR). DWR takes innovative microarchitectural steps to adjust warp size during runtime and according to program characteristics. DWR outperforms static warp size decisions, up to 1.7X to 2.28X, while imposing less than 1% area overhead. We investigate various alternative configurations and show that DWR performs better for narrower SIMD and larger caches.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Hardware-Software Integrated Solution for Improved Single-Instruction Multi-Thread Processor Efficiency

This thesis proposes using an integrated hardware-software solution for improving Single-Instruction Multiple-Thread branching efficiency. Unlike current SIMT hardware branching architectures, this hardware-software solution allows programmers the ability to fine tune branching behavior for their application or allow the compiler to implement a generic software solution. To support a wide range...

متن کامل

Warp-Level Parallelism: Enabling Multiple Replications In Parallel on GPU

Stochastic simulations need multiple replications in order to build confidence intervals for their results. Even if we do not need a large amount of replications, it is a good practice to speed-up the whole simulation time using the Multiple Replications In Parallel (MRIP) approach. This approach usually supposes to have access to a parallel computer such as a symmetric multiprocessing machine ...

متن کامل

Simty: a Synthesizable General-Purpose SIMT Processor

Simty is a massively multi-threaded processor core that dynamically assembles SIMD instructions from scalar multi-thread code. It runs the RISC-V (RV32-I) instruction set. Unlike existing SIMD or SIMT processors like GPUs, Simty takes binaries compiled for generalpurpose processors without any instruction set extension or compiler changes. Simty is described in synthesizable RTL. A FPGA prototy...

متن کامل

Multi-tier Dynamic Vectorization for Translating GPU Optimizations into CPU Performance

Developing high performance GPU code is labor intensive. Ideally, developers could recoup high GPU development costs by generating high-performance programs for CPUs and other architectures from the same source code. However, current OpenCL compilers for non-GPUs do not fully exploit optimizations in well-tuned GPU codes. To address this problem, we develop an OpenCL implementation that efficie...

متن کامل

Multi-Level Cache Resizing

Hardware designers are constantly looking for ways to squeeze waste out of architectures to achieve better power efficiency. Cache resizing is a technique that can remove wasteful power consumption in caches. The idea is to determine the minimum cache a program needs to run at near-peak performance, and then reconfigure the cache to implement this efficient capacity. While there has been signif...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1208.2374 شماره

صفحات -

تاریخ انتشار 2012

Dynamic Warp Resizing in High-Performance SIMT

نویسندگان

چکیده

منابع مشابه

A Hardware-Software Integrated Solution for Improved Single-Instruction Multi-Thread Processor Efficiency

Warp-Level Parallelism: Enabling Multiple Replications In Parallel on GPU

Simty: a Synthesizable General-Purpose SIMT Processor

Multi-tier Dynamic Vectorization for Translating GPU Optimizations into CPU Performance

Multi-Level Cache Resizing

عنوان ژورنال:

اشتراک گذاری